Abort execution when platform telemetry error#6827
Conversation
…lemetry errors Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
d9fa5cd to
d752bc2
Compare
Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
|
I have updated this branch to master. |
|
@pditommaso let me know if the latest changes look good to you I like the general principle of "plugin observer errors are logged as warnings by default, observer can throw AbortRunException to fail the run" This PR currently just adds an env var to control whether the TowerClient throws hard/soft errors. But I wonder if we should just decide for each error case whether it should be hard or soft instead of introducing an environment var For example, if I run with In any case, let's wait until #6946 is merged since it refactors the tower client (harder to resolve merge conflicts there) |
|
From today's discussion -- @jorgee please update this PR making all failures related to sending data to Platform hard failures, removing the need for the environment variable. |
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
|
I have removed the environment variable and throw
|
|
For the heartbeats, I think keeping them as warnings is reasonable For the |
I' ll try to check what it does in this case. Anyway, it will not receive the trace event. The only possibility is if it monitors the head job and use the process exit code to decide if failed or not. In that case, I think it is better to warn instead of throwing the AbortRunException |
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
da4827b to
1d06df8
Compare
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
|
Confirmed! I have run an execution skipping the trace completed event. The workflow state at platform is not undetermined. As head-job finishes, the workflow execution is considered completed, but marked as FAILED. Last updates:
|
This pull request introduces a new mechanism to control error handling behavior in the
TowerClientclass by adding anabortOnErrorflag, which can be set via the environment variableTOWER_ABORT_ON_ERROR. When enabled, critical errors encountered while communicating with Seqera Platform will cause the workflow to abort immediately using theAbortRunException. The changes also include improved error propagation and additional tests to verify this behavior.Error Handling Improvements:
abortOnErrorflag toTowerClient, defaulting totrue, and made it configurable via theTOWER_ABORT_ON_ERRORenvironment variable. This determines whether critical errors abort the workflow or are handled as warnings. [1] [2]TowerClientmethods (logHttpResponse,parseTowerResponse, and others) to throwAbortRunExceptionwhenabortOnErroris enabled, ensuring immediate workflow termination on critical errors. [1] [2] [3] [4] [5]Session and Exception Propagation:
Sessionclass to specifically catch and logAbortRunExceptionduring observer notification, ensuring these exceptions propagate and abort the workflow as intended. [1] [2]Tests:
TowerClientTestto verify the correct detection of theabortOnErrorsetting and to ensure that the workflow aborts as expected when errors occur andabortOnErroris enabled. [1] [2]